Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-631] Update index fields with ParticipantIdentifier and propagate participant id fields #110

Merged
merged 1 commit into from
Apr 11, 2024

Conversation

philerooski
Copy link
Contributor

  • Update index fields
  • If parent parquet dataset has ParticipantID field, propagate that to the child
  • Small change to the FitbitEcg schema to include the export start and end date fields

And a small update to FitbitEcg table schema that should have
been included when we originally added this data type.
@philerooski philerooski requested a review from a team as a code owner April 10, 2024 21:59
Copy link

sonarcloud bot commented Apr 10, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Copy link
Contributor

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

"garminhrvsummary": ["ParticipantID", "StartTimeInSeconds"],
"garminmanuallyupdatedactivitysummary": ["ParticipantID", "SummaryId"],
"garminmoveiqactivitysummary": ["ParticipantID", "SummaryId"],
"garminpulseoxsummary": ["ParticipantID", "SummaryId"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was ParticipantID the wrong key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a one-to-one mapping from ParticipantIdentifier to ParticipantID. They are two different ways of representing the same identifier.

ParticipantIdentifier is the only field present in every data type (including SymptomLog) and the CE docs suggest it's the more "official" identifier -- kind of like HealthCode to ExternalId in mPower.

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 Thanks for the quick work here!

).distinct()
index_fields = INDEX_FIELD_MAP[table_data_type]
additional_fields = [selectable_original_field_name, "cohort"]
if "ParticipantID" in parent_table.columns:
Copy link
Contributor

@rxu17 rxu17 Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we modify /add a new test for this function to check this part that ParticipantID gets included in additional_fields if it exists? I was reading the JIRA ticket/slack thread - why would both ParticipantID and ParticipantIdentifier exist in a dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, yes we should. I'm always forgetting about tests 🤦

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would both ParticipantID and ParticipantIdentifier exist in a dataset?

I don't know the exact reason, but it's not unusual to have one be the "global" identifier and the other to be a study or app specific identifier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not confusing at all :D

+ INDEX_FIELD_MAP[table_data_type]
)
).distinct()
index_fields = INDEX_FIELD_MAP[table_data_type]
Copy link
Contributor

@rxu17 rxu17 Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that now Participant_Identifier is a required index field for every data type (as far as I can tell in INDEX_FIELD_MAP), I'm thinking maybe we could have a test to ensure that Participant_Identifier is in all the key-value pairs in the dict of INDEX_FIELD_MAP. This is more of future proofing/double checking that we don't modify code that would accidentally affect this. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to assume that ParticipantIdentifier will be included with every data type going forward. That's a CE decision, and why I explicitly included the field in the INDEX_FIELD_MAP rather than specifying it once in this function.

Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for looking into this! Just a few comments.

@philerooski philerooski merged commit f312f6a into main Apr 11, 2024
15 checks passed
@philerooski philerooski deleted the etl-631 branch April 11, 2024 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants